Code
library(tidyverse)
library(gganimate)
library(knitr)
library(gifski)
library(broom)
library(kableExtra)library(tidyverse)
library(gganimate)
library(knitr)
library(gifski)
library(broom)
library(kableExtra)Previous code from project proposal:
life_expectancy <- read.csv("lex.csv")
gdp_per_capita <- read.csv("gdp_pcap.csv")
clean_life_expectancy <- life_expectancy |>
# calculating proportion of missing values for each row
# source: https://www.statology.org/rowmeans-in-r/
filter(rowMeans(across(-country, is.na)) <= 0.2) |>
# gsub: https://www.statology.org/gsub-r/
# rename: https://dplyr.tidyverse.org/reference/rename.html
rename_with(~ gsub("^X", "lex_", .), starts_with("X"))
convert_k <- function(vec){
has_k <- str_detect(vec, "k")
vec_numeric <- as.numeric(str_replace(vec, "k", ""))
return(vec_numeric * ifelse(has_k, 1000, 1))
}
clean_gdp_per_capita <- gdp_per_capita |>
rename_with(~ gsub("^X", "gdp_per_capita_", .), starts_with("X")) |>
# all columns type -> numeric
mutate(across(-country, convert_k))
long_life_expectancy <- clean_life_expectancy |>
pivot_longer(
cols = starts_with("lex_"),
names_to = "year",
names_prefix = "lex_",
values_to = "life_expectancy"
)
long_gdp_per_capita <- clean_gdp_per_capita |>
pivot_longer(
cols = starts_with("gdp_per_capita_"),
names_to = "year",
names_prefix = "gdp_per_capita_",
values_to = "gdp_per_capita"
)
clean_long_df <- long_life_expectancy |>
inner_join(long_gdp_per_capita,
by = c("country", "year")) clean_long_df |>
filter(year == 2017) |>
ggplot(aes(x = gdp_per_capita, y = life_expectancy)) +
geom_jitter(alpha = 0.5, size = 1) +
labs(
x = "GDP per Capita (2017 US$ PPP)",
y = "",
title = "GDP per Capita vs. Life Expectancy in 2017 Globally",
subtitle = "Life Expectancy (in years)") +
theme_linedraw()In this first above relationship, depicting GDP per Capita and Life Expectancy Globally in 2017, we see a positive relationship between the two variables. At first, as GDP per Capita increases, life expectancy increases dramatically. Then, as GDP per Capita reaches increases further, towards the bound of our graph, the positive relationship is less clear and there are diminishing returns in terms of life expectancy. This pattern is simlar to the graph of a logarithm. This graph shows an intensely unequal world, with GDP per Capita varying greatly for countries of varying development and wealth. Additionally, we see an unequal spread of life expectancy even among wealthier countries, where wealth inequality and various healthcare systems likely come into play.
clean_long_df$year <- as.numeric(clean_long_df$year)
animated_over_time <- clean_long_df |>
filter(year <= 2017) |>
ggplot(aes(x = gdp_per_capita, y = life_expectancy)) +
geom_jitter(alpha = 0.5, size = 1) +
labs(
x = "GDP per Capita (2017 US$ PPP)",
y = "",
title = "GDP per Capita vs. Life Expectancy: Year {round(frame_time)}",
subtitle = "Life Expectancy (in years)") +
theme_linedraw() +
transition_time(year) +
ease_aes("linear")
# Source: https://gganimate.com/
# should show in the "viewer" tab and also when rendered in html
animate(animated_over_time, renderer = gifski_renderer(), nframes = 150, duration = 20)This animation shows the same graph from above (GDP per Capita vs Life Expectancy) over the years 1800-2017. The value here comes from our ability to see an increasing gap between various country’s wealth levels (and life expectancy as a result), but also in the overall life expectancy rising across the globe. This chart reinforces the idea presented above of ever-increasing wealth inequality globally, with some countries exploding in wealth and others staying nearly the same, all from relatively similar starting points in the year 1800. This animation also shows that the logarithmic pattern is followed with a medium strength.
Before running the linear regression, summarising the data to one x value and one y value per country in the years 1939 to 1945 would be a great way to simplify the regression model and ensuring each country is represented consistently. We decided to investigate 1939 to 1945 because these are the years the Second World War occured.
The linear regression contains two variables of interest: average GDP per capitia for each country during the Second World War (explanatory variable: \(X\)) and average life expectancy for each country during the Second World War (response variable: \(Y\)).
options(scipen = 999) # disable scientific notation!
summarised_lm <- clean_long_df |>
filter(year %in% 1939:1945) |>
group_by(country) |>
summarize(
avg_gdp_per_capita = mean(gdp_per_capita, na.rm = TRUE),
avg_life_expectancy = mean(life_expectancy, na.rm = TRUE)
) |>
ungroup()
# Get Regression Model:
lm_model <- lm(avg_life_expectancy ~ avg_gdp_per_capita,
data = summarised_lm)From the linear regression model, the population regression model is represented by this equation:
\[ Y = 35.321 + 0.0022X + \epsilon \]
Where \(\epsilon\) ~ \(\textit{N}(0, \sigma)\) accounts for random noise.
summary_var <- lm_model |>
augment() |>
# we only need three columns
select(avg_life_expectancy, .fitted, .resid) |>
pivot_longer(cols = c(avg_life_expectancy, .fitted, .resid),
names_to = "variables",
values_to = "values"
) |>
map_at("variables", as.factor) |>
bind_cols() |>
# change the variable names
mutate(
variables = fct_recode(variables,
"Response" = "avg_life_expectancy",
"Fitted" = ".fitted",
"Residual" = ".resid")
) |>
# make table fancy
group_by(variables) |>
summarize(variance = round(var(values), 2)) |>
kable(col.names = c("Variable Name", "𝛔̂²"),
caption = "Table 1: Variability of the Regression Model",
align = "c") |>
kable_classic(full_width = F, html_font = "Cambria")
summary_var| Variable Name | 𝛔̂² |
|---|---|
| Fitted | 43.29 |
| Residual | 51.17 |
| Response | 94.46 |
In Table 1, the estimated variances have been calculated for the predicted life expectancy (fitted values), the residuals, and the actual life expectancy (response values). First, the variance of the response values represents the total amount of variation in life expectancy across observations. Second, we have the variance of the fitted value, which captures how much of the variability in life expectancy is explained by GDP per capita. Third, is the residual variance, which represents the unexplained variability - that is, the portion of life expectancy variation that GDP per capita does not account for.
To determine the proportion of the variability in the response values that was accounted in our model, we would first need to calculate the R² , which is done by doing this:
\[ R^2 = \frac{\hat{\sigma}^2_{\text{Fitted}}}{\hat{\sigma}^2_{\text{Response}}} \]
R2 <- 43.29 / 94.46
round(R2 * 100, 2)[1] 45.83
Based on the result, our model explains about 45.83% of the variability in life expectancy using GDP per capita. This means that 54.17% of variability remains unexplained. With an R² of 45.83%, the quality of our model is moderate. This suggest that the model is useful but not highly predictive. Although it will give us an insight into the relationship between economic prosperity and life expectancy, it lacks other factors needed for a highly accurate prediction. In other words, GDP per capita is not the sole determining factor for a person’s life expectancy. There are other things to consider as well, such as healthcare access, education, environmental conditions, and government policies.
set.seed(537) # reproducibility
rand_error <- function(x, mean = 0, sd) {
x + rnorm(length(x), mean = mean, sd = sd)
}
pred_life <- predict(lm_model)
est_sigma <- sigma(lm_model)
sim_response <- tibble(simulated_response = rand_error(pred_life, sd = est_sigma))
full_data <- summarised_lm |>
select(avg_gdp_per_capita, avg_life_expectancy) |>
bind_cols(sim_response)
full_data |>
ggplot(aes(x = avg_gdp_per_capita, y = simulated_response)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Simulated Data GDP vs Life Expectancy, 1939 - 1945",
x = "GDP Per Capita",
y = " ",
subtitle = "Simulated Life Expectancy") +
theme_linedraw() +
xlim(0, 20000) +
ylim(0, 100)
full_data |>
ggplot(aes(x = avg_gdp_per_capita, y = avg_life_expectancy)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Observed GDP vs Life Expectancy, 1939 - 1945",
x = "GDP per Capita",
y = "",
subtitle = "Actual Life Expectancy") +
theme_linedraw() +
xlim(0, 20000) +
ylim(0, 100)